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Abstract 

Most research related to unithood were con- 
ducted as part of a larger effort for the deter- 
mination of termhood. Consequently, nov- 
elties are rare in this small sub-field of term 
extraction. In addition, existing work were 
mostly empirically motivated and derived. 
We propose a new probabilistically-derived 
measure, independent of any influences of 
termhood, that provides dedicated measures 
to gather linguistic evidence from parsed 
text and statistical evidence from Google 
search engine for the measurement of unit- 
hood. Our comparative study using 1,825 
test cases against an existing empirically- 
derived function revealed an improvement in 
terms of precision, recall and accuracy. 

1 Introduction 

Automatic term recognition, also referred to as 
term extraction or terminology mining, is the pro- 
cess of extracting lexical units from text and fil- 
tering them for the purpose of identifying terms 
which characterise certain domains of interest. This 
process involves the determination of two factors: 
unithood and termhood. Unithood concerns with 
whether or not a sequence of words should be com- 
bined to form a more stable lexical unit. On the 
other hand, termhood measures the degree to which 
these stable lexical units are related to domain- 
specific concepts. Unithood is only relevant to com- 
plex terms (i.e. multi-word terms) while termhood 
( [Wong et al., 2007a] l deals with both simple terms 



(i.e. single-word terms) and complex terms. Re- 



cent reviews by (Wong et al., 2007b I show that ex- 
isting research on unithood are mostly carried out 
as a prerequisite to the determination of termhood. 
As a result, there is only a small number of existing 
measures dedicated to determining unithood. Be- 
sides the lack of dedicated attention in this sub-field 
of term extraction, the existing measures are usu- 
ally derived from term or document frequency, and 
are modified as per need. As such, the significance 
of the different weights that compose the measures 
usually assume an empirical viewpoint. Obviously, 
such methods are at most inspired by, but not derived 



from formal models (Kageura and Umino, 1996 1. 

The three objectives of this paper are (1) to sepa- 
rate the measurement of unithood from the determi- 
nation of termhood, (2) to devise a probabilistically- 
derived measure which requires only one thresh- 
old for determining the unithood of word se- 
quences using non-static textual resources, and (3) 
to demonstrate the superior performance of the new 
probabilistically-derived measure against existing 
empirical measures. In regards to the first objective, 
we will derive our probabilistic measure free from 
any influence of termhood determination. Follow- 
ing this, our unithood measure will be an indepen- 
dent tool that is applicable not only to term extrac- 
tion, but many other tasks in information extraction 
and text mining. Concerning the second objective, 
we will devise our new measure, known as the Odds 
of Unithood {OU), which are derived using Bayes 
Theorem and founded on a few elementary probabil- 
ities. The probabilities are estimated using Google 
page counts in an attempt to eliminate problems re- 



lated to the use of static corpora. Moreover, only 
one threshold, namely, OUt is required to control 
the functioning of OU. Regarding the third objec- 
tive, we will compare our new OU against an ex- 
isting empirically-derived measure called Unithood 



(UH) (Wong et al., 2007b l in terms of their preci- 
sion, recall and accuracy. 

In Section[2l we provide a brief review on some of 
existing techniques for measuring unithood. In Sec- 
tion m we present our new probabilistic approach, 
the measures involved, and the theoretical and in- 
tuitive justification behind every aspect of our mea- 
sures. In Section HI we summarize some findings 
from our evaluations. Finally, we conclude this pa- 
per with an outlook to future work in Section [5] 

2 Related Works 

Some of the most common measures of unit- 
hood include pointwise mutual information (MI) 



dChurch and Hanks, 1990| ) and log-likelihood ratio 
( Dunning, 1994| ). In mutual information, the co- 
occurrence frequencies of the constituents of com- 
plex terms are utilised to measure their dependency. 
The mutual information for two words a and b is de- 
fined as: 

p{a, b) 



MI{a, b) = logs 



p{a)p(b) 



(1) 



where p{a) and p{b) are the probabilities of occur- 
rence of a and b. Many measures that apply sta- 
tistical techniques assuming strict normal distribu- 
tion, and independence between the word occur- 
rences ( |Franz, 1997[ ) do not fare well. For han- 
dling extremely uncommon words or small sized 
corpus, log-likelihood ratio delivers the best preci- 
sion ( jKurz and Xu, 20021 ). Log-likelihood ratio at- 
tempts to quantify how much more likely one pair 
of words is to occur compared to the others. De- 
spite its potential, "How to apply this statistic mea- 
sure to quantify structural dependency of a word 
sequence remains an interesting issue to explore." 



dKit, 2002[ ). dSeretan et al, 2004| ) tested mutual in- 
formation, log-likelihood ratio and t-tests to exam- 
ine the use of results from web search engines for 
determining the collocational strength of word pairs. 
However, no performance results were presented. 

( [Wong et al., 2007b I presented a hybrid approach 
inspired by mutual information in Equation [B and 



C-value in Equation [3l The authors employ Google 
page counts for the computation of statistical evi- 
dences to replace the use of frequencies obtained 
from static corpora. Using the page counts, the au- 
thors proposed a function known as Unithood (UH) 
for determining the mergeability of two lexical units 
Qx and Qy to produce a stable sequence of words s. 
The word sequences are organised as a set 1^ = 
{s,ax,ay} where s = a^buy is a term candidate, 
b can be any preposition, the coordinating conjunc- 
tion "and" or an empty string, and ax and ay can 
either be noun phrases in the form Adj*N-\- or an- 
other s (i.e. defining a new s in terms of other s). 
The authors define UH as: 

1 if {MI{ax, ay) > MI+)V 
{MI+ > MI{ax,ay) 

> MI- A 

ID{ax,s) > IDt a 
ID{ay,s) > IDt A 
IDR+ > IDR{ax,ay) 

> IDR-) 
otherwise 

(2) 

where M/+, MI', IDt, IDR+ and IDR- 
are thresholds for determining mergeability deci- 
sions, and MI{ax, ay) is the mutual information be- 
tween ax and ay, while ID{ax,s), ID{ay,s) and 
IDR{ax, ay) are measures of lexical independence 
of ax and ay from s. For brevity, let z be either ax or 
ay, and the independence measure ID{z, s) is then 
defined as: 



UH{ax,ay 



ID{z,s) 




if(n^ > Us) 
otherwise 



where and is the Google page count for z and 
s respectively. On the other hand, IDR{ax, ay) = 

jz?(a°"!) • Intuitively, UH{ax,ay) states that the two 
lexical units ax and ay can only be merged in two 
cases, namely, 1) if ax and ay has extremely high 
mutual information (i.e. higher than a certain thresh- 
old M/+), or 2) if ax and ay achieve average mu- 
tual information (i.e. within the acceptable range of 
two thresholds M /+ and MI~) due to both of their 
extremely high independence (i.e. higher than the 
threshold IDt) from s. 



( |Frantzi, 1997| l proposed a measure known as 
Cvalue for extracting complex terms. The measure 
is based upon the claim that a substring of a term 
candidate is a candidate itself given that it demon- 
strates adequate independence from the longer ver- 
sion it appears in. For example, "E. coli food poi- 
soning", "E. coli" and "food poisoning" are accept- 
able as valid complex term candidates. However, 
"E. coli food" is not. Therefore, some measures 
are required to gauge the strength of word combina- 
tions to decide whether two word sequences should 
be merged or not. Given a word sequence a to be 
examined for unithood, the Cvalue is defined as: 



Cvalue{a) 



log2 \a\fa 
log2 \a\Ua 



\La\ ■ 



if \a\ = g 
otherwise 
(3) 

where \a\ is the number of words in a, La is the 
set of longer term candidates that contain a, g is 
the longest n-gram considered, fa is the frequency 
of occurrence of a, and a ^ La- While certain re- 
searchers dKit, 2002] l consider Cvalue as a termhood 
measure, others ( [Nakagawa and Mori, 2002 1 accept 
it as a measure for unithood. One can observe that 
longer candidates tend to gain higher weights due to 
the inclusion of log2 |a| in Equation [3] In addition, 
the weights computed using Equation [3] are purely 
dependent on the frequency of a. 

3 A Probabilistically-derived Measure for 
Unithood Determination 

We propose a probabilistically-derived measure for 
determining the unithood of word pairs (i.e. poten- 
tial term candidates) extracted using the head-driven 



left-right filter (Wong, 2005} |Wong et al., 2007b| ) 



and Stanford Parser (Klein and Manning, 2003 1. 
These word pairs will appear in the form of 
{ax, ay) € A with and ay located immediately 
next to each other (i.e. x + 1 = y), or separated 
by a preposition or coordinating conjunction "and" 
(i.e. X + 2 = y). Obviously, ax has to appear before 
ay in the sentence or in other words, x < y for all 
pairs where x and y are the word offsets produced by 
the Stanford Parser. The pairs in A will remain as 
potential term candidates until their unithood have 
been examined. Once the unithood of the pairs in 
A have been determined, they will be referred to as 



term candidates. Formally, the unithood of any two 
lexical units ax and ay can be defined as 

Definition 1 The unithood of two lexical units 
is the "degree of strength or stability of 
syntagmatic combinations and collocations" 



\Kageura and Umino, 1996^ between them. 

It is obvious that the problem of measuring the 
unithood of any pair of words is the determination 
of their "degree" of collocational strength as men- 
tioned in Definition [U In practical terms, the "de- 
gree " mentioned above will provide us with a way to 
determine if the units and ay should be combined 
to form s, or left alone as separate units. The collo- 
cational strength of ax and ay that exceeds a certain 
threshold will demonstrate to us that s is able to form 
a stable unit and hence, a better term candidate than 
ax and ay separated. It is worth pointing that the 
size (i.e. number of words) of ax and ay is not lim- 
ited to 1. For example, we can have ax= "National 
Institute", b="of" and ay=" Allergy and Infectious 
Diseases". In addition, the size of and ay has no 
effect on the determination of their unithood using 
our approach. 

As we have discussed in Section |2j most of 
the conventional practices employ frequency of oc- 
currence from local corpora, and some statistical 
tests or information-theoretic measures to determine 
the coupling strength between elements m W = 
{s,ax,ay}. Two of the main problems associated 
with such approaches are: 



Data sparseness is a problem that is 
well-documented by many researchers 
dKeller et al., 2002) . It is inherent to the use of 
local corpora that can lead to poor estimation 
of parameters or weights; and 



• Assumption of independence and normality of 
word distribution are two of the many problems 
in language modelling ( |Franz, 1991) . While 
the independence assumption reduces text to 
simply a bag of words, the assumption of nor- 
mal distribution of words will often lead to in- 
correct conclusions during statistical tests. 

As a general solution, we innovatively employ re- 
sults from web search engines for use in a proba- 
bilistic framework for measuring unithood. 



As an attempt to address the first problem, we 
utilise page counts by Google for estimating the 
probability of occurrences of the lexical units in W. 
We consider the World Wide Web as a large general 
corpus and the Google search engine as a gateway 
for accessing the documents in the general corpus. 
Our choice of using Google to obtain the page count 
was merely motivated by its extensive coverage. In 
fact, it is possible to employ any search engines on 
the World Wide Web for this research. As for the 
second issue, we attempt to address the problem of 
determining the degree of collocational strength in 
terms of probabilities estimated using Google page 
count. We begin by defining the sample space, N as 
the set of all documents indexed by Google search 
engine. We can estimate the index size of Google, 
|A^| using function words as predictors. Function 
words such as "a", "is" and "with", as opposed to 
content words, appear with frequencies that aie rel- 
atively stable over many different genres. Next, we 
perform random draws (i.e. trial) of documents from 
N. For each lexical unit w £ W, there will be a cor- 
responding set of outcomes (i.e. events) from the 
draw. There will be three basic sets which are of 
interest to us: 

Definition 2 Basic events corresponding to each 
w G W: 

• X is the event that occurs in the document 

• Y is the event that ay occurs in the document 

• S is the event that s occurs in the document 

It should be obvious to the readers that since the doc- 
uments in S have to contain all two units Qx and ay, 
5 is a subset of X <r\Y ov S X r\Y . It is worth 
noting that even though S C X n y, it is highly 
unlikely that S = X r\Y since the two portions 
ax and ay may exist in the same document without 
being conjoined by h. Next, subscribing to the fre- 
quency interpretation of probability, we can obtain 
the probability of the events in Definition |2] in terms 
of Google page count: 



(4) 



P{Y) 
P{S) 



ny_ 

\N\ 

\N\ 



where n^., Uy and is the page count returned as 
the result of Google search using the term [ + "ax "], 
[+"ay"] and [+"s"], respectively. The pair of 
quotes that encapsulates the search terms is the 
phrase operator, while the character "+" is the re- 
quired operator supported by the Google search en- 
gine. As discussed earlier, the independence as- 
sumption required by certain information-theoretic 
measures and other Bayesian approaches may not al- 
ways be valid, especially when we are dealing with 
linguistics. As such, P{X OY) ^ P{X)P{Y) 
since the occurrences of and ay in documents are 
inevitably governed by some hidden variables and 
hence, not independent. Following this, we define 
the probabilities for two new sets which result from 
applying some set operations on the basic events in 
Definition [21 



P(xny) = j5^ 
p{x r\Y\S) = P{x n F) - P{S) 



(5) 



where Uxy is the page count returned by Google 
for the search using [+"ax" +"ay"]. Defining 
P{Xr\Y) in terms of observable page counts, rather 
than a combination of two independent events will 
allow us to avoid any unnecessary assumption of in- 
dependence. 

Next, referring back to our main problem dis- 
cussed in Definition [T] we are required to estimate 
the strength of collocation of the two units ax and 
ay. Since there is no standard metric for such mea- 
surement, we propose to address the problem from 
a probabilistic perspective. We introduce the proba- 
bility that s is a stable lexical unit given the evidence 
s possesses: 

Definition 3 Probability of unithood: 



P{U\E) 



P{E\U)P{U) 



where U is the event that s is a stable lexical unit 
and E is the evidences belonging to s. P{U\E) is 
the posterior probability that s is a stable unit given 
the evidence E. P{U) is the prior probability that s 
is a unit without any evidence, and P{E) is the prior 
probability of evidences held by s. As we shall see 
later, these two prior probabilities will be immaterial 



in the final computation of unithood. Since s can 
either be a stable unit or not, we can state that, 



P{U\E) = 1-P{U\E) 



(6) 



where U is the event that s is not a stable lexical unit. 
Since Odds = P/{1 — P), we multiply both sides 
of Definition [3]by (1 - P{U\E))-'^ to obtain. 



PiU\E) 



P{E\U)P{U) 



l-P{U\E) P{E){1- P{U\E)) 



(V) 



By substituting Equation [6] in Equation |7] and later, 
applying the multiplication rule P{U\E)P{E) = 
P{E\U)P{U) to it, we will obtain: 



P{U\E) _ P{E\U)P{U) 



(8) 



P{U\E) P{E\U)P{U) 

We proceed to take the log of the odds in Equation [8] 
(i.e. logit) to get: 

, PiE\U) , P{U\E) , P(U) 
log ^ ' = log ^ -' ' - log — uJ. (9) 
^ P(E\U) ^ P{U\E) ^ P{U) ^ ' 

While it is obvious that certain words tend to co- 
occur more frequently than others (i.e. idioms 
and collocations), such phenomena are largely ar- 
bitrary ( [Smadja, 1993 1. This makes the task of 
deciding on what constitutes an acceptable col- 
location difficult. The only way to objectively 
identify stable lexical units is through observa- 
tions in samples of the language (e.g. text cor- 
pus) ( |McKeown and Radev, 2000] ). In other words, 
assigning the apriori probability of collocational 
strength without empirical evidence is both subjec- 
tive and difficult. As such, we are left with the op- 
tion to assume that the probability of s being a stable 
unit and not being a stable unit without evidence is 
the same (i.e. P{U) = P{U) = 0.5). As a result, 
the second term in Equation devaluates to 0: 



log 



P{U\E) 
P{U\E) 



log 



P{E\U) 



(10) 



We introduce a new measure for determining the 
odds of s being a stable unit known as Odds of Unit- 
hood (OU): 

Definition 4 Odds of unithood 



OU{s) = log 



P{E\U) 
P{E\U) 




(a) The area with darker (b) The area with darker 
shade is the set X n K \ S. shade is the set S' . Comput- 
Computing the ratio of P{S) ing the ratio of P{S) and the 
and the probability of this area probability of this area (i.e. 
will give us the first evidence. P{S') = 1 — PiS)) will give 

us the second evidence. 

Figure 1: The probability of the areas with darker 
shade are the denominators required by the evi- 
dences ei and 62 for the estimation of OU{s). 



Assuming that the evidences in E are independent 
of one another, we can evaluate OU (s) in terms of: 



OU{s) = log 



UiPie^\U) 



n^P{e^\U) 

Vlog^i^ 

P{ei\U) 



(11) 



where are individual evidences possessed by s. 

With the introduction of Definition |4j we can ex- 
amine the degree of collocational strength of 
and Qy in forming s, mentioned in Definition [T] in 
terms of OU{s). With the base of the log in Def- 
inition |4] more than 1, the upper and lower bound 
of OU{s) would be +00 and —00, respectively. 
OU{s) = +00 and OU{s) = —00 corresponds to 
the highest and the lowest degree of stability of the 
two units ttx and ay appearing as s, respectively. A 
highQ OU{s) would indicate the suitability for the 
two units ax and ay to be merged to form s. Ulti- 
mately, we have reduced the vague problem of the 
determination of unithood introduced in Definition 
[Uinto a practical and computable solution in Defini- 
tion |4l The evidences that we propose to employ for 
determining unithood are based on the occurrences 
of s, or the event S if the readers recall from Defini- 
tion |2] We are interested in two types of occurrences 
of s, namely, the occurrence of s given that and 
ay have already occurred or X n y, and the occur- 
rence of s as it is in our sample space, N. We refer 
to the first evidence ei as local occurrence, while 



'a subjective issue that may be determined using a threshold 



the second one 62 as global occurrence. We will 
discuss the intuitive justification behind each type of 
occurrences. Each evidence ei captures the occur- 
rences of s within a different confinement. We will 
estimate these evidences in terms of the elementary 
probabilities already defined in Equations H] and |5] 

The first evidence ei captures the probability of 
occurrences of s within the confinement of ax and ay 
or Xr\Y. As such, P{ei\U) can be interpreted as the 
probability of s occurring within X n y as a stable 
unit or P{S\X n Y). On the other hand, P(ei|C7) 
captures the probability of s occurring in X r\Y not 
as a unit. In other words, P(ei|C7) is the probability 
of s not occurring in X n 1", or equivalently, equal 
to P{{X r\Y\S)\{Xr\ Y)). The set X n F \ S is 



shown as the area with darker shade in Figure 1(a) 



Let us define the odds based on the first evidence as: 



Ol 



PjeiP) 
P{ei\U) 



(12) 



Substituting P(ei|C/) = P(5[X n Y) and 
P{ei\U) = P{{X n y \ 5)|(X n y)) into Equa- 
tion [12] win give us: 



Ol 



p{S\xr\Y) 



P{{XnY\S)\{Xr\Y)) 

p{Sr\{xr\Y)) P(xny) 
p{x n y) P((xny \5) n (xny)) 

p{Sr\{xr\Y)) 



P{{xnY\S)f^{xr\Y)) 

and since 5 C (Xny) and (Xny\5) C (Xny), 



Ol 



P{S) 



p{x nY\s) 



if{P{xnY\s)^o) 



and Ol = 1 if P{X n y \ 5) = 0. 

The second evidence 62 captures the probability 
of occurrences of s without confinement. If s is a 
stable unit, then its probability of occurrence in the 
sample space would simply be P{S). On the other 
hand, if s occurs not as a unit, then its probability of 
non-occurrence is 1 — P{S). The complement of S, 
which is the set S' is shown as the area with darker 
shade in Figure |l(b)[ Let us define the odds based 
on the second evidence as: 



Og 



P{e2\U) 
Pie2\U) 



(13) 



Substituting P{e2\U) = P{S) and P{e2\U) = 1 
P{S) into Equation [T3] will give us: 



Og 



PjS) 
1 - P{S) 



Intuitively, the first evidence attempts to capture 
the extent to which the existence of the two lexical 
units ax and ay is attiibutable to s. Referring back 
to Ol, whenever the denominator P{X nY\S) be- 
comes less than P{S), we can deduce that and 
ay actually exist together as s more than in other 
forms. At one extreme when P{X r\Y\S) = 0, 
we can conclude that the co-occurrence of a^ and 
ay is exclusively for s. As such, we can also refer to 
Ol as a measure of exclusivity for the use of a^- and 
ay with respect to s. This first evidence is a good 
indication for the unithood of s since the more the 
existence of and ay is attributed to s, the stronger 
the collocational strength of s becomes. Concerning 
the second evidence, Og attempts to capture the ex- 
tent to which s occurs in general usage (i.e. World 
Wide Web). We can consider Og as a measure of 
pervasiveness for the use of s. As s becomes more 
widely used in text, the numerator in Og will in- 
crease. This provides a good indication on the unit- 
hood of s since the more s appears in usage, the like- 
lier it becomes that s is a stable unit instead of an oc- 
currence by chance when ax and ay are located next 
to each other. As a result, the derivation of OU{s) 
using Ol and Og will ensure a comprehensive way 
of determining unithood. 

Finally, expanding OU (s) in Equation [TT] using 
Equations [T2] and [T3] will give us: 



OC/(s) = log Ol + log Og 
P{S) 



(14) 



log 



p{x n y \ 5) 



+ log 



P{S) 
1 - P{S) 



As such, the decision on whether ax and ay should 
be merged to form s can be made based solely on 
the Odds of Unithood (OU) defined in Equation [141 
We will merge and ay if their odds of unithood 
exceeds a certain threshold, OUt- 

4 Evaluations and Discussions 

For this evaluation, we employed 500 news arti- 
cles from Reuters in the health domain gathered be- 
tween December 2006 to May 2007. These 500 ar- 



tides are fed into the Stanford Parser whose out- 
put is then used by our head-driven left-right filter 



( |Wong, 2005} I Wong et al, 2007b | ) to extract word 
sequences in the form of nouns and noun phrases. 
Pairs of word sequences (i.e. ax and ay) located 
immediately next to each other, or separated by a 
preposition or the conjunction "and" in the same 
sentence are measured for their unithood. Using the 
500 news articles, we managed to obtain 1, 825 pairs 
of words to be tested for unithood. 

We performed a comparative study of our 
new probabilistic approach against the empirically- 
derived unithood function described in Equation |2l 
Two experiments were conducted. In the first one, 
we assessed our probabilistically-derived measure 
OU{s) as described in Equation [14] where the de- 
cisions on whether or not to merge the 1, 825 pairs 
are done automatically. These decisions are known 
as the actual results. At the same time, we inspected 
the same list manually to decide on the merging of 
all the pairs. These decisions are known as the ideal 
results. The threshold OUt employed for our evalu- 
ation is determined empirically through experiments 
and is set to —8.39. However, since only one thresh- 
old is involved in deciding mergeability, training al- 
gorithms and data sets may be employed to auto- 
matically decide on an optimal number. This op- 
tion is beyond the scope of this paper. The actual 
and ideal results for this first experiment are organ- 
ised into a contingency table (not shown here) for 
identifying the true and the false positives, and the 
true and the false negatives. In the second experi- 
ment, we conducted the same assessment as carried 
out in the first one but the decisions to merge the 
1,825 pairs ai^e based on the UH{ax,ay) function 
described in Equation [2l The thresholds required for 
this function are based on the values suggested by 
( |Wongetal.,2007b| l, namely, M/+ = 0.9, MI- = 
0.02, IDt = 6, IDR+ = 1.35, and IDR- = 0.93. 

Using the results from the contingency tables, 
we computed the precision, recall and accuracy for 
the two measures under evaluation. Table 1 sum- 
marises the performance of OU{s) and UH{ax, ay) 
in determining the unithood of 1,825 pairs of lex- 
ical units. One will notice that our new measure 
OU{s) outperformed the empirically-derived func- 
tion UH{ax, ay) in all aspects, with an improvement 
of 2.63%, 3.33% and 2.74% for precision, recall and 



Table 1: The performance of OU{s) (from Exper- 
iment 1) and UH{ax,ay) (from Experiment 2) in 
terms of precision, recall and accuracy. The last 
column shows the difference in the performance of 
Experiment 1 and 2. 





Experiment 1 
using OU^l 


Experiment 
using UH(a^,aj^ 


Difference 
(Experiment 1- 
Experiment 2) 




luu.uu% 


9 ".3"% 


2.63% 


Accuracy 


95.83% 


92.50% 


3.33% 


97.26% 


94.52% 


2.74% 



accuracy, respectively. Our new measure achieved a 
100% precision with a lower recall at 95.83%. As 
with any measures that employ thresholds as a cut- 
off point in accepting or rejecting certain decisions, 
we can improve the recall of OU{s) by decreasing 
the threshold OUt- In this way, there will be less 
false negatives (i.e. pairs which are supposed to be 
merged but are not) and hence, increases the recall 
rate. Unfortunately, recall will improve at the ex- 
pense of precision since the number of false pos- 
itives will definitely increase from the existing 0. 
Since our application (i.e. ontology learning) re- 
quires perfect precision in determining the unithood 
of word sequences, OU{s) is the ideal candidate. 
Moreover, with only one threshold (i.e. OUt) re- 
quired in controlling the function of OC/(s), we are 
able to reduce the amount of time and effort spent 
on optimising our results. 

5 Conclusion and Future Work 

In this paper, we highlighted the significance of unit- 
hood and that its measurement should be given equal 
attention by researchers in term extraction. We fo- 
cused on the development of a new approach that 
is independent of influences of termhood measure- 
ment. We proposed a new probabilistically-derived 
measure which provide a dedicated way to deter- 
mine the unithood of word sequences. We refer to 
this measure as the Odds of Unithood (OU). OU is 
derived using Bayes Theorem and is founded upon 
two evidences, namely, local occurrence and global 
occurrence. Elementary probabilities estimated us- 
ing page counts from the Google search engine are 
utilised to quantify the two evidences. The new 
probabilistically-derived measure OU is then eval- 



uated against an existing empirical function known 
as Unithood (UH). Our new measure OU achieved a 
precision and a recall of 100% and 95.83% respec- 
tively, with an accuracy at 97.26% in measuring the 
unithood of 1, 825 test cases. OU outperformed UH 
by 2.63%, 3.33% and 2.74% in terms of precision, 
recall and accuracy, respectively. Moreover, our new 
measure requires only one threshold, as compared to 
five in UH to control the mergeability decision. 

More work is required to establish the coverage 
and the depth of the World Wide Web with regards 
to the determination of unithood. While the Web has 
demonstrated reasonable strength in handling gen- 
eral news articles, we have yet to study its appropri- 
ateness in dealing with unithood determination for 
technical text (i.e. the depth of the Web). Similarly, 
it remains a question the extent to which the Web 
is able to satisfy the requirement of unithood deter- 
mination for a wider range of genres (i.e. the cov- 
erage of the Web). Studies on the effect of noises 
(e.g. keyword spamming) and multiple word senses 
on unithood determination using the Web is another 
future research direction. 
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